{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "518b983e",
   "metadata": {},
   "source": [
    "# Discord Thread Management\n",
    "\n",
    "This notebook walks through the process of managing documents that come from ever-updating data sources.\n",
    "\n",
    "In this example, we have a directory where the #issues-and-help channel on the LlamaIndex discord is dumped periodically. We want to ensure our index always has the latest data, without duplicating any messages."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a168a3c",
   "metadata": {},
   "source": [
    "## Indexing discord data\n",
    "\n",
    "Discord data is dumped as sequential messages. Every message has useful information such as timestamps, authors, and links to parent messages if the message is part of a thread.\n",
    "\n",
    "The help channel on our discord commonly uses threads when solving issues, so we will group all the messages into threads, and index each thread as it's own document.\n",
    "\n",
    "First, let's explore the data we are working with."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "93b6022b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['help_channel_dump_06_02_23.json', 'help_channel_dump_05_25_23.json']\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "\n",
    "print(os.listdir(\"./discord_dumps\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c83dc57",
   "metadata": {},
   "source": [
    "As you can see, we have two dumps from two different dates. Let's pretend we only have the older dump to start with, and we want to make an index from that data.\n",
    "\n",
    "First, let's explore the data a bit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "ac1bcd6f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "JSON keys:  dict_keys(['guild', 'channel', 'dateRange', 'messages', 'messageCount']) \n",
      "\n",
      "Message Count:  5087 \n",
      "\n",
      "Sample Message Keys:  dict_keys(['id', 'type', 'timestamp', 'timestampEdited', 'callEndedTimestamp', 'isPinned', 'content', 'author', 'attachments', 'embeds', 'stickers', 'reactions', 'mentions']) \n",
      "\n",
      "First Message:  If you're running into any bugs, issues, or you have questions as to how to best use GPT Index, put those here! \n",
      "- If it's a bug, let's also track as a GH issue: https://github.com/jerryjliu/gpt_index/issues. \n",
      "\n",
      "Last Message:  Hello there! How can I use llama_index with GPU?\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "\n",
    "with open(\"./discord_dumps/help_channel_dump_05_25_23.json\", \"r\") as f:\n",
    "    data = json.load(f)\n",
    "print(\"JSON keys: \", data.keys(), \"\\n\")\n",
    "print(\"Message Count: \", len(data[\"messages\"]), \"\\n\")\n",
    "print(\"Sample Message Keys: \", data[\"messages\"][0].keys(), \"\\n\")\n",
    "print(\"First Message: \", data[\"messages\"][0][\"content\"], \"\\n\")\n",
    "print(\"Last Message: \", data[\"messages\"][-1][\"content\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ecd4bc3",
   "metadata": {},
   "source": [
    "Conviently, I have provided a script that will group these messages into threads. You can see the `group_conversations.py` script for more details. The output file will be a json list where each item in the list is a discord thread."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "8d082c13",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Done! Written to conversation_docs.json\r\n"
     ]
    }
   ],
   "source": [
    "!python ./group_conversations.py ./discord_dumps/help_channel_dump_05_25_23.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "f5c59130",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Thread keys:  dict_keys(['thread', 'metadata']) \n",
      "\n",
      "{'timestamp': '2023-01-02T03:36:04.191+00:00', 'id': '1059314106907242566'} \n",
      "\n",
      "arminta7:\n",
      "Hello all! Thanks to GPT_Index I've managed to put together a script that queries my extensive personal note collection which is a local directory of about 20k markdown files. Some of which are very long. I work in this folder all day everyday, so there are frequent changes. Currently I would need to rerun the entire indexing (is that the correct term?) when I want to incorporate edits I've made. \n",
      "\n",
      "So my question is... is there a way to schedule indexing to maybe once per day and only add information for files that have changed? Or even just manually run it but still only add edits? This would make a huge difference in saving time (I have to leave it running overnight for the entire directory) as well as cost 😬. \n",
      "\n",
      "Excuse me if this is a dumb question, I'm not a programmer and am sort of muddling around figuring this out 🤓 \n",
      "\n",
      "Thank you for making this sort of project accessible to someone like me!\n",
      "ragingWater_:\n",
      "I had a similar problem which I solved the following way in another world:\n",
      "- if you have a list of files, you want something which says that edits were made in the last day, possibly looking at the last_update_time of the file should help you.\n",
      "- for decreasing the cost, I would suggest maybe doing a keyword extraction or summarization of your notes and generating an embedding for it. Take your NLP query and get the most similar file (cosine similarity by pinecone db should help, GPTIndex also has a faiss) this should help with your cost needs\n",
      " \n",
      "\n"
     ]
    }
   ],
   "source": [
    "with open(\"conversation_docs.json\", \"r\") as f:\n",
    "    threads = json.load(f)\n",
    "print(\"Thread keys: \", threads[0].keys(), \"\\n\")\n",
    "print(threads[0][\"metadata\"], \"\\n\")\n",
    "print(threads[0][\"thread\"], \"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62ab605b",
   "metadata": {},
   "source": [
    "Now, we have a list of threads, that we can transform into documents and index!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee3329af",
   "metadata": {},
   "source": [
    "## Create the initial index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "f05f0266",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import Document\n",
    "\n",
    "# create document objects using doc_id's and dates from each thread\n",
    "documents = []\n",
    "for thread in threads:\n",
    "    thread_text = thread[\"thread\"]\n",
    "    thread_id = thread[\"metadata\"][\"id\"]\n",
    "    timestamp = thread[\"metadata\"][\"timestamp\"]\n",
    "    documents.append(\n",
    "        Document(text=thread_text, id_=thread_id, metadata={\"date\": timestamp})\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "41db23f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import VectorStoreIndex\n",
    "\n",
    "index = VectorStoreIndex.from_documents(documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45f98917",
   "metadata": {},
   "source": [
    "Let's double check what documents the index has actually ingested"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "e1e913fa",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ref_docs ingested:  767\n",
      "number of input documents:  767\n"
     ]
    }
   ],
   "source": [
    "print(\"ref_docs ingested: \", len(index.ref_doc_info))\n",
    "print(\"number of input documents: \", len(documents))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b605d5d1",
   "metadata": {},
   "source": [
    "So far so good. Let's also check a specific thread to make sure the metadata worked, as well as checking how many nodes it was broken into"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "31c36e9f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RefDocInfo(node_ids=['0c530273-b6c3-4848-a760-fe73f5f8136e'], metadata={'date': '2023-01-02T03:36:04.191+00:00'})\n"
     ]
    }
   ],
   "source": [
    "thread_id = threads[0][\"metadata\"][\"id\"]\n",
    "print(index.ref_doc_info[thread_id])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71689fcb",
   "metadata": {},
   "source": [
    "Perfect! Our thread is rather short, so it was directly chunked into a single node. Furthermore, we can see the date field was set correctly.\n",
    "\n",
    "Next, let's backup our index so that we don't have to waste tokens indexing again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "cbe23135",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Double check ref_docs ingested:  767\n"
     ]
    }
   ],
   "source": [
    "# save the initial index\n",
    "index.storage_context.persist(persist_dir=\"./storage\")\n",
    "\n",
    "# load it again to confirm it worked\n",
    "from llama_index import StorageContext, load_index_from_storage\n",
    "\n",
    "index = load_index_from_storage(StorageContext.from_defaults(persist_dir=\"./storage\"))\n",
    "\n",
    "print(\"Double check ref_docs ingested: \", len(index.ref_doc_info))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55643a4c",
   "metadata": {},
   "source": [
    "## Refresh the index with new data!\n",
    "\n",
    "Now, suddenly we remember we have that new dump of discord messages! Rather than rebuilding the entire index from scratch, we can index only the new documents using the `refresh()` function.\n",
    "\n",
    "Since we manually set the `doc_id` of each index, LlamaIndex can compare incoming documents with the same `doc_id` to confirm a) if the `doc_id` has actually been ingested and b) if the content as changed\n",
    "\n",
    "The refresh function will return a boolean array, indicating which documents in the input were refreshed or inserted. We can use this to confirm that only the new discord threads are inserted!\n",
    "\n",
    "When a documents content has changed, the `update()` function is called, which removes and re-inserts the document from the index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "971357d5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "JSON keys:  dict_keys(['guild', 'channel', 'dateRange', 'messages', 'messageCount']) \n",
      "\n",
      "Message Count:  5286 \n",
      "\n",
      "Sample Message Keys:  dict_keys(['id', 'type', 'timestamp', 'timestampEdited', 'callEndedTimestamp', 'isPinned', 'content', 'author', 'attachments', 'embeds', 'stickers', 'reactions', 'mentions']) \n",
      "\n",
      "First Message:  If you're running into any bugs, issues, or you have questions as to how to best use GPT Index, put those here! \n",
      "- If it's a bug, let's also track as a GH issue: https://github.com/jerryjliu/gpt_index/issues. \n",
      "\n",
      "Last Message:  Started a thread.\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "\n",
    "with open(\"./discord_dumps/help_channel_dump_06_02_23.json\", \"r\") as f:\n",
    "    data = json.load(f)\n",
    "print(\"JSON keys: \", data.keys(), \"\\n\")\n",
    "print(\"Message Count: \", len(data[\"messages\"]), \"\\n\")\n",
    "print(\"Sample Message Keys: \", data[\"messages\"][0].keys(), \"\\n\")\n",
    "print(\"First Message: \", data[\"messages\"][0][\"content\"], \"\\n\")\n",
    "print(\"Last Message: \", data[\"messages\"][-1][\"content\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "931c9508",
   "metadata": {},
   "source": [
    "As we can see, the first message is the same as the orignal dump. But now we have ~200 more messages, and the last message is clearly new! `refresh()` will make updating our index easy.\n",
    "\n",
    "First, let's create our new threads/documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "fad09032",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Done! Written to conversation_docs.json\r\n"
     ]
    }
   ],
   "source": [
    "!python ./group_conversations.py ./discord_dumps/help_channel_dump_06_02_23.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "f387c0dc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Thread keys:  dict_keys(['thread', 'metadata']) \n",
      "\n",
      "{'timestamp': '2023-01-02T03:36:04.191+00:00', 'id': '1059314106907242566'} \n",
      "\n",
      "arminta7:\n",
      "Hello all! Thanks to GPT_Index I've managed to put together a script that queries my extensive personal note collection which is a local directory of about 20k markdown files. Some of which are very long. I work in this folder all day everyday, so there are frequent changes. Currently I would need to rerun the entire indexing (is that the correct term?) when I want to incorporate edits I've made. \n",
      "\n",
      "So my question is... is there a way to schedule indexing to maybe once per day and only add information for files that have changed? Or even just manually run it but still only add edits? This would make a huge difference in saving time (I have to leave it running overnight for the entire directory) as well as cost 😬. \n",
      "\n",
      "Excuse me if this is a dumb question, I'm not a programmer and am sort of muddling around figuring this out 🤓 \n",
      "\n",
      "Thank you for making this sort of project accessible to someone like me!\n",
      "ragingWater_:\n",
      "I had a similar problem which I solved the following way in another world:\n",
      "- if you have a list of files, you want something which says that edits were made in the last day, possibly looking at the last_update_time of the file should help you.\n",
      "- for decreasing the cost, I would suggest maybe doing a keyword extraction or summarization of your notes and generating an embedding for it. Take your NLP query and get the most similar file (cosine similarity by pinecone db should help, GPTIndex also has a faiss) this should help with your cost needs\n",
      " \n",
      "\n"
     ]
    }
   ],
   "source": [
    "with open(\"conversation_docs.json\", \"r\") as f:\n",
    "    threads = json.load(f)\n",
    "print(\"Thread keys: \", threads[0].keys(), \"\\n\")\n",
    "print(threads[0][\"metadata\"], \"\\n\")\n",
    "print(threads[0][\"thread\"], \"\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "2d737e98",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create document objects using doc_id's and dates from each thread\n",
    "new_documents = []\n",
    "for thread in threads:\n",
    "    thread_text = thread[\"thread\"]\n",
    "    thread_id = thread[\"metadata\"][\"id\"]\n",
    "    timestamp = thread[\"metadata\"][\"timestamp\"]\n",
    "    new_documents.append(\n",
    "        Document(text=thread_text, id_=thread_id, metadata={\"date\": timestamp})\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "f3c97735",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of new documents:  13\n"
     ]
    }
   ],
   "source": [
    "print(\"Number of new documents: \", len(new_documents) - len(documents))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "87c4b886",
   "metadata": {},
   "outputs": [],
   "source": [
    "# now, refresh!\n",
    "refreshed_docs = index.refresh(\n",
    "    new_documents, update_kwargs={\"delete_kwargs\": {\"delete_from_docstore\": True}}\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f85a4fec",
   "metadata": {},
   "source": [
    "By default, if a document's content has changed and it is updated, we can pass an extra flag to `delete_from_docstore`. This flag is `False` by default because indexes can share the docstore. But since we only have one index, removing from the docstore is fine here.\n",
    "\n",
    "If we kept the option as `False`, the document information would still be removed from the `index_struct`, which effectively makes that document invisibile to the index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "647025a9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of newly inserted/refreshed docs:  15\n"
     ]
    }
   ],
   "source": [
    "print(\"Number of newly inserted/refreshed docs: \", sum(refreshed_docs))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c72fa1f",
   "metadata": {},
   "source": [
    "Interesting, we have 13 new documents, but 15 documents were refreshed. Did someone edit their message? Add more text to a thread? Let's find out"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "a66882a6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[False, True, False, False, True, False, False, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True]\n"
     ]
    }
   ],
   "source": [
    "print(refreshed_docs[-25:])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "196aaa94",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Document(id_='1110938122902048809', embedding=None, weight=1.0, metadata={'date': '2023-05-24T14:31:28.732+00:00'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='36d308d1d2d1aa5cbfdb2f7d64709644a68805ec22a6053943f985084eec340e', text='Siddhant Saurabh:\\nhey facing error\\n```\\n*error_trace: Traceback (most recent call last):\\n File \"/app/src/chatbot/query_gpt.py\", line 248, in get_answer\\n   context_answer = self.call_pinecone_index(request)\\n File \"/app/src/chatbot/query_gpt.py\", line 229, in call_pinecone_index\\n   self.source.append(format_cited_source(source_node.doc_id))\\n File \"/usr/local/lib/python3.8/site-packages/llama_index/data_structs/node.py\", line 172, in doc_id\\n   return self.node.ref_doc_id\\n File \"/usr/local/lib/python3.8/site-packages/llama_index/data_structs/node.py\", line 87, in ref_doc_id\\n   return self.relationships.get(DocumentRelationship.SOURCE, None)\\nAttributeError: \\'Field\\' object has no attribute \\'get\\'\\n```\\nwith latest llama_index 0.6.9\\n@Logan M @jerryjliu98 @ravitheja\\nLogan M:\\nHow are you inserting nodes/documents? That attribute on the node should be set automatically usually\\nSiddhant Saurabh:\\nI think this happened because of the error mentioned by me here https://discord.com/channels/1059199217496772688/1106229492369850468/1108453477081948280\\nI think we need to re-preprocessing for such nodes, right?\\n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n')"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_documents[-21]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "d4f95b70",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Document(id_='1110938122902048809', embedding=None, weight=1.0, metadata={'date': '2023-05-24T14:31:28.732+00:00'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='c995c43873440a9d0263de70fff664269ec70d751c6e8245b290882ec5b656a1', text='Siddhant Saurabh:\\nhey facing error\\n```\\n*error_trace: Traceback (most recent call last):\\n File \"/app/src/chatbot/query_gpt.py\", line 248, in get_answer\\n   context_answer = self.call_pinecone_index(request)\\n File \"/app/src/chatbot/query_gpt.py\", line 229, in call_pinecone_index\\n   self.source.append(format_cited_source(source_node.doc_id))\\n File \"/usr/local/lib/python3.8/site-packages/llama_index/data_structs/node.py\", line 172, in doc_id\\n   return self.node.ref_doc_id\\n File \"/usr/local/lib/python3.8/site-packages/llama_index/data_structs/node.py\", line 87, in ref_doc_id\\n   return self.relationships.get(DocumentRelationship.SOURCE, None)\\nAttributeError: \\'Field\\' object has no attribute \\'get\\'\\n```\\nwith latest llama_index 0.6.9\\n@Logan M @jerryjliu98 @ravitheja\\nLogan M:\\nHow are you inserting nodes/documents? That attribute on the node should be set automatically usually\\n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n')"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents[-8]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "970427f1",
   "metadata": {},
   "source": [
    "Nice! The newer documents contained threads that had more messages. As you can see, `refresh()` was able to detect this and automatically replaced the older thread with the updated text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d22afb4d",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama-index",
   "language": "python",
   "name": "llama-index"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
